Learning objectives

The three main objectives of this session are

  1. What the geographic level in the UK? How to access and plot these?

  2. What is spatial autocorrelation? How do we describe it?

  3. What is a gravity model? What data do we need to implement the simplest one?


UK administrative Geographies


Worth noting about the Statistical Building Blocks is that they are derived from populations counts, not areas. Below is an overview of the thresholds used to create these geographies.

More about these population-weighted geographies here


Getting Data

Polygons

You can get geographic data for the UK from the open geography portal via an API call.

library(raster)
library(knitr)
library(geojsonio)

library(sp)
library(tmap)

library(spdep)

library(reshape2)

library(rsq)
#connect to the open geography portal API
regions_json <- geojson_read("https://opendata.arcgis.com/datasets/8d3a9e6e7bd445e2bdcc26cdf007eac7_1.geojson", what = "sp")
z_regions_json <- regions_json
plot(regions_json )


Variables

I’m using table WU02EW - Location of usual residence and place of work by age.

#Connect to the NOMIS API to get Data
#Note: I'm only using England here for convience.
t_wu02Ew  <- read.csv(file = "https://www.nomisweb.co.uk/api/v01/dataset/NM_1206_1.data.csv?date=latest&usual_residence=2013265921...2013265930&place_of_work=2013265921...2013265930&age=0...6&measures=20100", header=TRUE)
kable(head(t_wu02Ew))
DATE DATE_NAME DATE_CODE DATE_TYPE DATE_TYPECODE DATE_SORTORDER USUAL_RESIDENCE USUAL_RESIDENCE_NAME USUAL_RESIDENCE_CODE USUAL_RESIDENCE_TYPE USUAL_RESIDENCE_TYPECODE USUAL_RESIDENCE_SORTORDER PLACE_OF_WORK PLACE_OF_WORK_NAME PLACE_OF_WORK_CODE PLACE_OF_WORK_TYPE PLACE_OF_WORK_TYPECODE PLACE_OF_WORK_SORTORDER AGE AGE_NAME AGE_CODE AGE_TYPE AGE_TYPECODE AGE_SORTORDER MEASURES MEASURES_NAME OBS_VALUE OBS_STATUS OBS_STATUS_NAME OBS_CONF OBS_CONF_NAME URN RECORD_OFFSET RECORD_COUNT
2011 2011 2011 date 0 0 2013265921 North East E12000001 regions 480 0 2013265921 North East E12000001 regions 480 0 0 Aged 16 and over 0 Age 1000 0 20100 Value 936525 A Normal Value FALSE Free (free for publication) Nm-1206d1d32176e1d2013265921d2013265921d0d20100 0 700
2011 2011 2011 date 0 0 2013265921 North East E12000001 regions 480 0 2013265921 North East E12000001 regions 480 0 1 Aged 16-24 1 Age 1000 1 20100 Value 130827 A Normal Value FALSE Free (free for publication) Nm-1206d1d32176e1d2013265921d2013265921d1d20100 1 700
2011 2011 2011 date 0 0 2013265921 North East E12000001 regions 480 0 2013265921 North East E12000001 regions 480 0 2 Aged 25-34 2 Age 1000 2 20100 Value 199584 A Normal Value FALSE Free (free for publication) Nm-1206d1d32176e1d2013265921d2013265921d2d20100 2 700
2011 2011 2011 date 0 0 2013265921 North East E12000001 regions 480 0 2013265921 North East E12000001 regions 480 0 3 Aged 35-49 3 Age 1000 3 20100 Value 344176 A Normal Value FALSE Free (free for publication) Nm-1206d1d32176e1d2013265921d2013265921d3d20100 3 700
2011 2011 2011 date 0 0 2013265921 North East E12000001 regions 480 0 2013265921 North East E12000001 regions 480 0 4 Aged 50-64 4 Age 1000 4 20100 Value 243426 A Normal Value FALSE Free (free for publication) Nm-1206d1d32176e1d2013265921d2013265921d4d20100 4 700
2011 2011 2011 date 0 0 2013265921 North East E12000001 regions 480 0 2013265921 North East E12000001 regions 480 0 5 Aged 65-74 5 Age 1000 5 20100 Value 15701 A Normal Value FALSE Free (free for publication) Nm-1206d1d32176e1d2013265921d2013265921d5d20100 5 700

This dataset is currently a flat 2D table despite containing multi-way tabulated counts: by region,origin,destination and age group. To wrangle this into counts by geographic unit to use with the polygon data, we apply the following transformation:

#names(t_wu02Ew )#Select only columns we need for now
ac <- c("USUAL_RESIDENCE_CODE","AGE_NAME","OBS_VALUE")
t_wu02Ew_ac <-t_wu02Ew[,ac]

reg_var <- xtabs(OBS_VALUE ~ USUAL_RESIDENCE_CODE + AGE_NAME, data=t_wu02Ew_ac)
reg_var <- as.data.frame.matrix(reg_var) 
kable(reg_var)
Aged 16-24 Aged 16 and over Aged 25-34 Aged 35-49 Aged 50-64 Aged 65-74 Aged 75+
E12000001 137047 974625 208221 358197 252002 16252 2906
E12000002 386853 2705931 598669 986277 666548 57777 9807
E12000003 294930 2029907 442170 739320 505157 41209 7121
E12000004 247828 1777612 372762 656200 454732 39633 6457
E12000005 290056 2106075 458558 771101 526138 51665 8557
E12000006 315871 2299955 496908 834637 581497 60791 10251
E12000007 374191 3197606 1056357 1101460 591100 60959 13539
E12000008 461335 3391170 731448 1238930 852032 91961 15464
E12000009 292428 2024395 417547 716798 531510 56730 9382
W92000004 160618 1117784 238664 404170 284858 25071 4403

This data gives information on the working population by place of residence by age group. For simplicity, we will combine the seven age groups into two. Let’s assume person over 65 could be retired, whilst all other ages we can expect to working to generate two variables:

retired <- c("Aged 65-74", "Aged 75+" )
reg_var$w_Age <- rowSums(reg_var[,!names(reg_var) %in% retired])
reg_var$r_Age <- rowSums(reg_var[,retired])

Combining the two

Although you could join your tables on region names, geography codes where available will give you a much cleaner merge.

#attach the data to the dataframe component of the Spatial data
#Note: 0 means attach by row names
regions_json@data <- merge(regions_json@data, reg_var, by.x= "rgn15cd", by.y=0 )

Plotting Your Maps

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(regions_json) + tm_polygons(col="r_Age")

So far we have covered the basics of UK geographies and data sources, including how to call and wrangle your data.



Spatial Autocorrelation

The main two point in flagging up spatial dependence are: firstly, defining who are your neighbours; and secondly, what your relationship to these is (which are expressed as weights).

One common way to define neighbours is by contiguity. This means sharing a common boundary and you can generally check this by checking for at least one common boundary point.


Moran’s I

\[ I=\frac{N}{W} \frac{\sum_{i} \sum_{j} w_{i j}\left(x_{i}-\overline{x}\right)\left(x_{j}-\overline{x}\right)}{\sum_{i}\left(x_{i}-\overline{x}\right)^{2}} \] where \({\displaystyle N}\) is the number of spatial units indexed by \({\displaystyle i}\) i and \({\displaystyle j}\) j
\({\displaystyle x}\) is the variable of interest
\({\displaystyle {\bar {x}}}\) is the mean of \({\displaystyle x}\)
\({\displaystyle w_{ij}}\) is a matrix of spatial weights with zeroes on the diagonal (i.e., \({\displaystyle w_{ii}=0}\) \({\displaystyle w_{ii}=0}\))
\({\displaystyle W}\) is the sum of all \({\displaystyle w_{ij}}\) \(w_{ij}\)


This is an index that ranges from -1 to 1.

Disperions images source.


Example

Defining your neighbours

#queen=T; neighbour can share only point
neighbs<- poly2nb(regions_json, queen=TRUE)
summary(neighbs)
## Neighbour list object:
## Number of regions: 9 
## Number of nonzero links: 30 
## Percentage nonzero weights: 37.03704 
## Average number of links: 3.333333 
## Link number distribution:
## 
## 2 3 4 5 
## 3 2 2 2 
## 3 least connected regions:
## 0 6 8 with 2 links
## 2 most connected regions:
## 3 7 with 5 links

Notice that we have links created for all nine regions. These can be access by:

regions_json$rgn15nm[1] # region
## [1] North East
## 9 Levels: East Midlands East of England London North East ... Yorkshire and The Humber
regions_json$rgn15nm[unlist(neighbs[[1]])] # its neighbours
## [1] North West               Yorkshire and The Humber
## 9 Levels: East Midlands East of England London North East ... Yorkshire and The Humber

Establishing your relationship to these
Now that the neighbours are defined, their relation needs to be articulated. This is done with the style param. The simplest way, which I will use here, is to assume equal weight for all neighbours:

lw <- nb2listw(neighbs, style="W")

Moran’s Global I
Having defined the neighbours and their relations, we can check our index and its corresponding p-value in two ways:

moran.test(regions_json@data$r_Age,lw)
## 
##  Moran I test under randomisation
## 
## data:  regions_json@data$r_Age  
## weights: lw    
## 
## Moran I statistic standard deviate = 1.5783, p-value = 0.05725
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.16636716       -0.12500000        0.03407958
#Permutation test for Moran's I statistic
MC<- moran.mc(regions_json@data$r_Age, lw, nsim=599)
MC
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  regions_json@data$r_Age 
## weights: lw  
## number of simulations + 1: 600 
## 
## statistic = 0.16637, observed rank = 562, p-value = 0.06333
## alternative hypothesis: greater

Note the above provides same index but different p-values. In both cases, the spatial distribution of our retired group is slightly correlated in space. Specifically this group is slightly clustered with a significance level of p > 0.05.

plot(MC, main="", las=1)


When dealing with many polygons a simulation-based approach would make more sense. Brunsdon & Comber go into this with more detail.. There are additional ways to visualise spatial autocorrelation, notably using variograms. You can find a useful starting point with code examples here.)

Geary’s C

A less popular alternative to Moran’s I is Geary’s C, also known as Geary’s contiguity ratio or simply Geary’s ratio. It ranges from 0 to 2. Inversely to \(I\), lower values \(G\) indicate clustering and higher values indicate dispersal. I looks very similar to \(I\):

\[ C=\frac{(N-1) \sum_{i} \sum_{j} w_{i j}\left(x_{i}-x_{j}\right)^{2}}{2 W \sum_{i}\left(x_{i}-\overline{x}\right)^{2}} \]

where \({\displaystyle N}\) is the number of spatial units indexed by \({\displaystyle i}\) i and \({\displaystyle j}\) j
\({\displaystyle x}\) is the variable of interest
\({\displaystyle {\bar {x}}}\) is the mean of \({\displaystyle x}\)
\({\displaystyle w_{ij}}\) is a matrix of spatial weights with zeroes on the diagonal (i.e., \({\displaystyle w_{ii}=0}\) \({\displaystyle w_{ii}=0}\))
\({\displaystyle W}\) is the sum of all \({\displaystyle w_{ij}}\) \(w_{ij}\)


geary.test(regions_json@data$r_Age,lw)
## 
##  Geary C test under randomisation
## 
## data:  regions_json@data$r_Age 
## weights: lw 
## 
## Geary C statistic standard deviate = 0.87315, p-value = 0.1913
## alternative hypothesis: Expectation greater than statistic
## sample estimates:
## Geary C statistic       Expectation          Variance 
##        0.82089773        1.00000000        0.04207492
#Permutation test for Moran's I statistic
MC<- geary.mc(regions_json@data$r_Age, lw, nsim=599)
MC
## 
##  Monte-Carlo simulation of Geary C
## 
## data:  regions_json@data$r_Age 
## weights: lw 
## number of simulations + 1: 600 
## 
## statistic = 0.8209, observed rank = 122, p-value = 0.2033
## alternative hypothesis: greater

Further Examples using Geary’s C


Getis and Ord’s G

At this point I have considered two global measures. These can be extended and calculated on a smaller subset to produce local measures. However, an intrinsically local measure is Getis and Ord’s \(G\), which gives an indication of clusters based on their values:

\[ G=\frac{\sum_{i=1} \sum_{j=1} w_{i, j} x_{i} x_{j}}{\sum_{i=1} \sum_{j=1} x_{i} x_{j}} \]

where the number spatial units indexed by \({\displaystyle i}\) i and \({\displaystyle j}\) j
\({\displaystyle x}\) is the variable of interest
\({\displaystyle w_{ij}}\) is a matrix of spatial weights with zeroes on the diagonal (i.e., \({\displaystyle w_{ii}=0}\) \({\displaystyle w_{ii}=0}\))

You can see worked \(G\) example here.


Finally, for consistency, I will just mention a final measure Ripleys’s K. This is a descriptive spatial statistic that shows autocorrelation based on multiple distance bands.Dr. David Murrel will go over this shortly.

Thus, we have an overview of the core descriptive spatial statistics: Moran’s I, Geary’s C, Getis and Ord’s G; and Ripleys’s K.



Gravity Models

Gravity models are used for estimating movements between two places. They are based on Newton’s law of gravitation:

\[ F_{i j}=G \frac{M_{i} M_{j}}{D_{i j}^{2}} \] where \(M_{i}\) and \(M_{j}\) are the mass of two objects \({\displaystyle G}\) is a constant that denotes gravitational pull
\({D_{i j}}\) is the distance between the two objects


For movement modelling, Newton’s force is adapted to describe the flow between two places. Masses are replaced by area characteristics: where you have some constant multiplied by some observations at the origins \(V_{i}^{\mu}\) and at the destinations \(W_{j}^{\alpha}\). In the numerator you have distance, which basically says that the force diminishes over space.

\[ T_{i j}= \frac{V_{i}^{\mu} W_{j}^{\omega}}{d_{i j}^{\beta}} \] ****

\({D_{i j}}\) generally looks like a power: \[ f\left(d_{i j}\right)=d_{i j}^{\mathcal{B}} \] Or exponential function:

\[ f\left(d_{i j}\right)=\exp \left(\beta d_{i j}\right) \] The shape and choice of the constant \(\beta\) will depend on how much value you give to proximity in predicting flows.


Assuming we choose an exponential function, the unconstrainted Gravity equations can be expressed as:

\[ T_{i j}=V_{i}^{\mu} W_{j}^{\alpha} f\left(d_{i j}\right) \] or more conveniently (assuming an exponential distance function) as: \[ \ln T_{i j}=k+\mu \ln V_{i}+\alpha \ln W_{j}-\beta d_{i j} \] Which can be modelled as a log-linear regression model.


Getting Distance

crs(regions_json)
## CRS arguments:
##  +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0

Firstly to get the distance, we need to check our coordinate system. A common mistake is to happily calculate the distance between polygons without verifying units. It is a dangerous one because in relative terms, the distances might look alright, but they won’t correspond to practical world measures. See a baseline for the UK below:

Which can be modelled as a log-linear regression model. Conveniently, for the BNG projection in the UK, Euclidean distance maps to meters.

#this is the BNG: CRS("+init=epsg:27700")
regions_json <-spTransform(regions_json, CRS("+init=epsg:27700"))
dist_m <- spDists(regions_json)
colnames(dist_m) <- regions_json$rgn15cd
rownames(dist_m) <- regions_json$rgn15cd
kable(dist_m)
E12000001 E12000002 E12000003 E12000004 E12000005 E12000006 E12000007 E12000008 E12000009
E12000001 0.0 121184.53 125638.67 243896.0 283721.2 348467.20 409329.80 425126.59 454548.3
E12000002 121184.5 0.00 99867.48 179046.7 176972.4 297039.67 333943.03 341419.79 339525.9
E12000003 125638.7 99867.48 0.00 118641.7 179350.2 224625.87 283966.05 301059.44 354195.8
E12000004 243896.0 179046.68 118641.72 0.0 111376.3 118317.43 165433.86 182984.30 267949.6
E12000005 283721.2 176972.43 179350.17 111376.3 0.0 193444.75 184072.14 179457.38 174904.7
E12000006 348467.2 297039.67 224625.87 118317.4 193444.8 0.00 94220.43 128473.25 290312.9
E12000007 409329.8 333943.03 283966.05 165433.9 184072.1 94220.43 0.00 36136.67 219008.5
E12000008 425126.6 341419.79 301059.44 182984.3 179457.4 128473.25 36136.67 0.00 187368.1
E12000009 454548.3 339525.89 354195.80 267949.6 174904.7 290312.89 219008.52 187368.14 0.0

For further use, you can convert the distances into list form.

#tranform the nxn matrix into a pairs list with nxn rows:
dist_pl <- melt(dist_m)
names(dist_pl) <- c('Origin','Destination','Distance_m2')
kable(head(dist_pl))
Origin Destination Distance_m2
E12000001 E12000001 0.0
E12000002 E12000001 121184.5
E12000003 E12000001 125638.7
E12000004 E12000001 243896.0
E12000005 E12000001 283721.2
E12000006 E12000001 348467.2

od <- c("USUAL_RESIDENCE_NAME","USUAL_RESIDENCE_CODE", "PLACE_OF_WORK_NAME","PLACE_OF_WORK_CODE", "AGE_NAME","OBS_VALUE")
t_wu02Ew_od <-t_wu02Ew[,od]
t_wu02Ew_od <-t_wu02Ew_od[t_wu02Ew$AGE_NAME %in% retired,] #get those "Aged 65-74", "Aged 75+"

t_wu02Ew_od <- aggregate(. ~USUAL_RESIDENCE_CODE + PLACE_OF_WORK_CODE + USUAL_RESIDENCE_NAME + PLACE_OF_WORK_NAME, data=t_wu02Ew_od, sum, na.rm=TRUE)
t_wu02Ew_od$AGE_NAME <- NULL
kable(t_wu02Ew_od[1:20,])
USUAL_RESIDENCE_CODE PLACE_OF_WORK_CODE USUAL_RESIDENCE_NAME PLACE_OF_WORK_NAME OBS_VALUE
E12000006 E12000006 East East 62825
E12000004 E12000006 East Midlands East 788
E12000007 E12000006 London East 2576
E12000001 E12000006 North East East 34
E12000002 E12000006 North West East 73
E12000008 E12000006 South East East 1039
E12000009 E12000006 South West East 127
W92000004 E12000006 Wales East 43
E12000005 E12000006 West Midlands East 94
E12000003 E12000006 Yorkshire and The Humber East 81
E12000006 E12000004 East East Midlands 430
E12000004 E12000004 East Midlands East Midlands 41618
E12000007 E12000004 London East Midlands 111
E12000001 E12000004 North East East Midlands 44
E12000002 E12000004 North West East Midlands 259
E12000008 E12000004 South East East Midlands 315
E12000009 E12000004 South West East Midlands 72
W92000004 E12000004 Wales East Midlands 30
E12000005 E12000004 West Midlands East Midlands 841
E12000003 E12000004 Yorkshire and The Humber East Midlands 751

Of these, we can get the total flows emitted from every origin \(O_i\) and equivalently the arriving at \(D_j\):

sum_dest <-  aggregate(t_wu02Ew_od$OBS_VALUE,
                by = list(t_wu02Ew_od$USUAL_RESIDENCE_CODE),
                FUN = sum)
names(sum_dest) <- c('Destination','Dj')

sum_orig <-  aggregate(t_wu02Ew_od$OBS_VALUE,
                by = list(t_wu02Ew_od$PLACE_OF_WORK_CODE),
                FUN = sum)
names(sum_orig) <- c('Origin','Oi')

Bring all the above together we get:

t_wu02Ew_od <- merge(t_wu02Ew_od, dist_pl, by.x=c("USUAL_RESIDENCE_CODE", "PLACE_OF_WORK_CODE"), by.y=c("Origin", "Destination"))

t_wu02Ew_od <- rbind(t_wu02Ew_od, sum_dest[sum_dest$Origin %in% t_wu02Ew_od$PLACE_OF_WORK_CODE])
t_wu02Ew_od <- merge(t_wu02Ew_od, sum_dest, by.x="PLACE_OF_WORK_CODE", by.y= "Destination")
t_wu02Ew_od <- merge(t_wu02Ew_od, sum_orig, by.x="USUAL_RESIDENCE_CODE", by.y= "Origin")
kable(head(t_wu02Ew_od))
USUAL_RESIDENCE_CODE PLACE_OF_WORK_CODE USUAL_RESIDENCE_NAME PLACE_OF_WORK_NAME OBS_VALUE Distance_m2 Dj Oi
E12000001 E12000001 North East North East 18512 0.0 19158 19262
E12000001 E12000004 North East East Midlands 44 243896.0 46090 19262
E12000001 E12000008 North East South East 47 425126.6 107425 19262
E12000001 E12000007 North East London 79 409329.8 74498 19262
E12000001 E12000002 North East North West 75 121184.5 67584 19262
E12000001 E12000006 North East East 34 348467.2 71042 19262

With the theoretical grounding, distance calculation example and correct dataformatting from above you have all you need to explore different families of gravity models. The most basic is the unconstrained (log-linear Poisson regression model) version as mentioned above:

#do not consider within flows
t_wu02Ew_od <- t_wu02Ew_od[t_wu02Ew_od$Distance_m2 != 0,]

#note: I have not applied a function to penalise for increasing distance. This is just for simplicity of the example
unconstrained_gm <- glm(OBS_VALUE ~ log(Oi)+log(Dj)+log(Distance_m2), na.action = na.exclude, family = poisson(link = "log"), data = t_wu02Ew_od)
#let's have a look at it's summary...
summary(unconstrained_gm)
## 
## Call:
## glm(formula = OBS_VALUE ~ log(Oi) + log(Dj) + log(Distance_m2), 
##     family = poisson(link = "log"), data = t_wu02Ew_od, na.action = na.exclude)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -60.133  -11.135   -5.893    1.386  100.783  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      11.76801    0.32101   36.66   <2e-16 ***
## log(Oi)           0.61235    0.01643   37.26   <2e-16 ***
## log(Dj)           0.56392    0.01572   35.87   <2e-16 ***
## log(Distance_m2) -1.54905    0.00801 -193.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 94447  on 71  degrees of freedom
## Residual deviance: 23912  on 68  degrees of freedom
## AIC: 24432
## 
## Number of Fisher Scoring iterations: 5
rsq(unconstrained_gm)
## [1] 0.5649217

Given that the data is in matrix form, there are ways of taking into account the row and column totals to calibrate the model. At a higher level, these are a ratio (or percent) for each cell, which takes into account the differences between the observed row, then column totals compared to what we would get if we only considered the distance (or more generally cost) associated with that particular type of flow. Note, I say flow, not cell, because we can assume that the cost of moving from some groups of places might be associated with the same cost and would therefore use the same calibrating ratio.

Wolwe, Buragard and Breisslein go into this in more detail, whilst discussing fairly new R package for trade gravity modelling.

Further reading: Gravity models of Trade in R
Unconstrained Model of Migration


Take-aways:

  1. Can understand, access and plot UK geographic data in meaningful projection.

  2. Understand the different spatial statistics used to describe patterns in space.

  3. Understand the basics of gravity modeling with regard to its programmatic data implementation.